Hello world
7
Inter-American Development Bank Workshop
2025-11-23
Why?
Python is useful due to the speed, reproducibility , flexibility, and an ecosystem of libraries
Ontegrates with R, SQL, Excel, Sata, and APIs
Scale: working with a small dataset? Good. Working with Big Data in the cloud? Perfect
Free
Why: Anaconda bundles Python, conda (env manager), Jupyter, and many data science packages.
Go to the Anaconda download page (choose Python 3.x installer) and download for your OS.
Run the installer and follow prompts (accept default options is fine for most users).
Open a terminal (Windows) by looking for the Anaconda Prompt terminal.
Open a terminal / Anaconda Prompt and run:
Why: Keeps project dependencies isolated (safer for reproducibility and error control).
Why: Environments come clean and you choose the dependencies and libraries that will be run on that project.
Some packages are unavailable in the anaconda network. You install them with pip:
Tip: Prefer conda install (faster, fewer build issues); use pip only when needed.
(Other packages are far more complex. You may need to download a wheel.)
Create a Notebook using the python kernel. We are ready to run our first program:
Variables store information for later use.
Bahamas 12500
0.032 True
Warning: Variables are stored in your RAM
<class 'int'> <class 'float'> <class 'str'> <class 'bool'>
Common types: - int: whole numbers - float: decimals - str: text - boolean: True/False
The country Bahamas has a GDP of 12500
and a growth rate of 3.2\%.
Warning: Variables are stored in your RAM
Bahamas grew 3.2\% and now has a GDP of 12900.0
# It is a good practice to comment the code.
# We first define the variables
gdp_1 = 12_500
gdp_2 = 12_900
# And we can now calculate the growth rate:
growth = gdp_2/gdp_1 - 1
print(f"The growth rate was {growth*100}\%.")
# Or round up when printing
print( f"The growth rate was {round(growth*100,2)}\%.")
# Or, first define the rounded growth
growth = round(growth,2)
print( f"The growth rate was {growth*100}\%.")The growth rate was 3.200000000000003\%.
The growth rate was 3.2\%.
The growth rate was 3.0\%.
Python includes relation operators (>, <, >=, !=, ==, in, is, not, ~) that return booleans (True/False)
True
False
False
True
False
We can use if, elif and else with the operators above.
Data structures organize and store multiple values efficiently.
They are different from simple variables that hold only one value.
| Structure | Example | Key Features |
|---|---|---|
| List | [1, 2, 3] |
Ordered, mutable |
| Tuple | (1, 2, 3) |
Ordered, immutable |
| Dictionary | {"name": "Ana", "age": 30} |
Key–value pairs |
| Set | {1, 2, 3} |
Unordered, unique elements |
for: iterate over a sequence (list, tuple, dictionary, string, range).while: Repeats as long as a condition is true.Count is: 0
Count is: 1
Count is: 2
Count is: 3
Count is: 4
Some useful statements: break (stops the loop), continue (skips code on that iteration), pass (does nothing/placeholder)
A function is a reusable block of code that performs a specific task.
It helps you avoid repetition and organize your program.
Hello, Diego
return sends a value back from the function to the place where the function was called.
Suppose that you want to run a model and save the results in a table, then do a different model and a table. Functions save space.
def make_table(values):
"""Return a list of (index, value) pairs as a simple table."""
table = []
for i, v in enumerate(values):
table.append((i, v))
return table
data = [200, -15, 100]
result = make_table(data)
print("Index | Value")
print("--------------")
for row in result:
print(f"{row[0]:5} | {row[1]:6}")Index | Value
--------------
0 | 200
1 | -15
2 | 100
Table-like structure (pandas library) with rows/columns
Pandas supports many ways to import data.
Saving data is easy.
import pandas as pd
data = {
"Country": ["Brazil", "Bahamas", "Mexico"],
"GDP": [2.1, 0.015, 1.8],
"Population": [214, 0.401, 130]
}
df = pd.DataFrame(data)
df.to_csv("session_1_files/example.csv", index=False)
df.to_csv("session_1_files/example.txt", sep="\t", index=False)
df.to_excel("session_1_files/example.xlsx", index=False)
df.to_json("session_1_files/example.json", orient="records")json is a very efficient format in data science that stores records as dictionaries
| Format | Structure | Best for | Advantages | Limitations |
|---|---|---|---|---|
| CSV / TXT | Flat (rows & columns) | Tabular data, spreadsheets | Simple and lightweight Works everywhere Easy to inspect |
No nested data No metadata Loses data types |
| Excel (.xlsx) | Flat (cells, sheets) | Business / reports | Formatting, formulas Multiple sheets Familiar to most users |
Slow Not ideal for automation |
| JSON | Hierarchical (nested) | Web data, APIs, configs | Stores complex/nested data Language-independent APIs |
Larger files Harder to view in Excel |
| Parquet / Feather | Columnar (binary) | Big data, analytics | Very fast I/O Keeps data types Compressed & efficient |
Not human-readable Requires Python/R |
| date | exports USA | imports USA | |
|---|---|---|---|
| 476 | 2024-09-30 | 36.066283 | 368.731802 |
| 477 | 2024-10-31 | 109.539796 | 305.212492 |
| 478 | 2024-11-30 | 219.298712 | 636.946114 |
| 479 | 2024-12-31 | 366.392066 | 520.629852 |
| 480 | 2025-01-31 | 65.009249 | 506.231056 |
import pandas as pd
df = pd.read_csv("session_1_files/uscensus_trade.csv")
print("====================")
print("Data info:")
print(df.info()) # Column names, data types, missing values
print("====================")
print("Summary stats:")
print(df.describe()) # Summary stats for numeric columns
print("====================")
print("Size:")
print(df.shape) # (rows, columns)
print("====================")
print("Column/Variable names:")
print(df.columns) # List of column names
print("====================")
print("Data types:")
print(df.dtypes) # Data types of each column<class 'pandas.core.frame.DataFrame'>
RangeIndex: 481 entries, 0 to 480
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 481 non-null object
1 exports USA 481 non-null float64
2 imports USA 481 non-null float64
dtypes: float64(2), object(1)
memory usage: 11.4+ KB
| exports USA | imports USA | |
|---|---|---|
| count | 481.000000 | 481.000000 |
| mean | 45.783287 | 170.553382 |
| std | 44.591570 | 128.572069 |
| min | 5.500000 | 43.200000 |
| 25% | 23.500000 | 65.200000 |
| 50% | 34.800000 | 125.860780 |
| 75% | 52.100000 | 243.605530 |
| max | 381.395095 | 850.473691 |
Suppose your dataset has many columns and you want to see descriptive statistics of only one.
When dealing with time-series it is a good idea to state the preferred index
When dealing with time-series it is a good idea to state the preferred index
Sometimes we need to make sure a column is read as a date. Use pd.to_datetime()
The column type: object
After transforming: datetime64[ns]
This becomes useful when merging and standardizing a dataset.
Also df.loc["2020-01-01":"2025-01-01"].plot()
| Method | Typical Use | Based On | Example | Notes |
|---|---|---|---|---|
pd.concat() |
Stack datasets vertically (rows) or horizontally (columns) | Axis (0 = rows, 1 = cols) | pd.concat([df1, df2]) |
Simple append; doesn’t match keys |
pd.merge() |
Combine on one or more columns (like SQL JOIN) | Key columns | pd.merge(df1, df2, on="id") |
Most flexible; supports inner, left, right, outer joins |
df.join() |
Combine on index (row labels) | Index alignment | df1.join(df2) |
Convenient when indices are meaningful |
| id | value | score | |
|---|---|---|---|
| 0 | 1 | A | NaN |
| 1 | 2 | B | NaN |
| 2 | 3 | C | NaN |
| 0 | 2 | NaN | 80.0 |
| 1 | 3 | NaN | 90.0 |
| 2 | 4 | NaN | 70.0 |
| value | score | |
|---|---|---|
| id | ||
| 1 | A | NaN |
| 2 | B | 80.0 |
| 3 | C | 90.0 |
| value | score | |
|---|---|---|
| id | ||
| 1 | A | NaN |
| 2 | B | 80.0 |
| 3 | C | 90.0 |
| date | Type | Value | |
|---|---|---|---|
| 0 | 1985-01-31 | exports USA | 154.400000 |
| 1 | 1985-02-28 | exports USA | 73.700000 |
| 2 | 1985-03-31 | exports USA | 47.700000 |
| 3 | 1985-04-30 | exports USA | 25.500000 |
| 4 | 1985-05-31 | exports USA | 38.300000 |
| ... | ... | ... | ... |
| 957 | 2024-09-30 | imports USA | 368.731802 |
| 958 | 2024-10-31 | imports USA | 305.212492 |
| 959 | 2024-11-30 | imports USA | 636.946114 |
| 960 | 2024-12-31 | imports USA | 520.629852 |
| 961 | 2025-01-31 | imports USA | 506.231056 |
962 rows × 3 columns
| Type | exports USA | imports USA |
|---|---|---|
| date | ||
| 1985-01-31 | 154.400000 | 59.000000 |
| 1985-02-28 | 73.700000 | 43.200000 |
| 1985-03-31 | 47.700000 | 43.400000 |
| 1985-04-30 | 25.500000 | 57.900000 |
| 1985-05-31 | 38.300000 | 68.000000 |
| ... | ... | ... |
| 2024-09-30 | 36.066283 | 368.731802 |
| 2024-10-31 | 109.539796 | 305.212492 |
| 2024-11-30 | 219.298712 | 636.946114 |
| 2024-12-31 | 366.392066 | 520.629852 |
| 2025-01-31 | 65.009249 | 506.231056 |
481 rows × 2 columns
Transforms data into a different frequency (e.g. daily to monthly/quarterly)
import pandas as pd
import matplotlib.pyplot as plt
# Create a daily dataset
df = pd.read_csv(
"session_1_files/eia_brent.csv",
parse_dates= True,
index_col = "date")
monthly_df = df.resample("M").mean()
plt.figure(figsize=(8,4))
plt.plot(df.index, df["brent price"], alpha=0.4, label="Daily")
plt.plot(monthly_df.index, monthly_df["brent price"], color="red", linewidth=2, label="Monthly")
plt.title("Daily vs. Monthly Brent price", fontsize=14, weight="bold")
plt.legend(frameon=False)
plt.show()import pandas as pd
import matplotlib.pyplot as plt
# Create a daily dataset
df = pd.read_csv(
"session_1_files/eia_brent.csv",
parse_dates= True,
index_col = "date")
monthly_df = df.resample("M").mean()
plt.figure(figsize=(8,4))
plt.plot(df.index, df["brent price"], alpha=0.4, label="Daily")
plt.plot(monthly_df.index, monthly_df["brent price"], color="red", linewidth=2, label="Monthly")
plt.title("Daily vs. Monthly Brent price", fontsize=14, weight="bold")
plt.legend(frameon=False)
plt.show()| Code | Frequency | Example Dates |
|---|---|---|
| “D” | Daily | 2020-01-01, 2020-01-02 |
| “W” | Week-End | 2020-01-05, 2020-01-12 |
| “M” | Month-End | 2020-01-31, 2020-02-29 |
| “MS” | Month-Start | 2020-01-01, 2020-02-01 |
| “Q” | Quarter-End | 2020-03-31, 2020-06-30 |
| “QS” | Quarter-Start | 2020-01-01, 2020-04-01 |
| “A” | Year-End | 2020-12-31, 2021-12-31 |
| “AS” | Year-Start | 2020-01-01, 2021-01-01 |
| Function | Description | Example |
|---|---|---|
.mean() |
Average over the period | df.resample("M").mean() |
.sum() |
Total over the period | df.resample("M").sum() |
.last() |
Last observation | df.resample("M").last() |
.first() |
First observation | df.resample("M").first() |
.max() |
Maximum value | df.resample("M").max() |
.min() |
Minimum value | df.resample("M").min() |
.median() |
Median value | df.resample("M").median() |
.agg(["mean","max"]) |
Multiple aggregations | df.resample("Q").agg(["mean","max"]) |
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42)
date_range = pd.date_range("2015-01-01", periods=60, freq="Q")
gdp = 1000 + np.cumsum(np.random.normal(5, 10, len(date_range)))
inflation = np.random.uniform(1.5, 4.0, len(date_range))
df = pd.DataFrame({"date": date_range, "GDP": gdp, "Inflation": inflation})
plt.figure(figsize=(8,5), facecolor="white")
gdp_color = "#1f77b4" # a strong blue
inflation_color = "#d62728" # a contrasting red
plt.plot(df["date"], df["GDP"], label="GDP (in billions)", linewidth=2.5, color=gdp_color)
plt.plot(df["date"], df["Inflation"]*250, label="Inflation (scaled)", linestyle="--", color=inflation_color)
plt.title("GDP and Inflation Over Time", fontsize=16, weight="bold")
plt.xlabel("Date")
plt.ylabel("Index / Scaled Value")
plt.legend(frameon=False)
plt.grid(alpha=0.3)
plt.gca().set_facecolor("white")
plt.show()sectors = ["Agriculture", "Manufacturing", "Tourism", "IT", "Finance"]
gdp_share = [5, 25, 20, 30, 20]
plt.figure(figsize=(7,4))
bars = plt.bar(sectors, gdp_share, color=["#1f77b4","#ff7f0e","#2ca02c","#9467bd","#d62728"])
plt.title("GDP Share by Sector", fontsize=16, weight="bold")
plt.ylabel("Percent of Total GDP")
plt.grid(axis="y", alpha=0.3)
plt.show()import seaborn as sns
np.random.seed(42)
data = pd.DataFrame({
"Sector": np.repeat(["Agriculture","Tourism","Finance","IT"], 50),
"Wage": np.concatenate([
np.random.normal(800, 100, 50),
np.random.normal(1200, 150, 50),
np.random.normal(2500, 300, 50),
np.random.normal(2200, 250, 50)
])
})
plt.figure(figsize=(8,5))
sns.boxplot(data=data, x="Sector", y="Wage", palette="Set2")
plt.title("Wage Distribution by Sector", fontsize=16, weight="bold")
plt.ylabel("Monthly Wage (USD)")
plt.grid(axis="y", alpha=0.3)
plt.show()gdp = np.random.uniform(1000, 4000, 40)
employment = gdp*0.04 + np.random.normal(0, 50, 40)
plt.figure(figsize=(7,5))
plt.scatter(gdp, employment, s=80, c="#1f77b4", alpha=0.7, edgecolors="white", linewidths=1)
plt.title("GDP vs Employment", fontsize=16, weight="bold")
plt.xlabel("GDP (in millions)")
plt.ylabel("Employment (in thousands)")
plt.grid(alpha=0.3)
plt.show()import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('seaborn-v0_8')
plt.style.use('seaborn-v0_8')
np.random.seed(42)
# --- Create a fake dataset ---
sectors = ["Agriculture", "Tourism", "Finance", "IT"]
data = pd.DataFrame({
"Sector": np.repeat(sectors, 80),
"Wage": np.concatenate([
np.random.normal(800, 100, 80),
np.random.normal(1200, 150, 80),
np.random.normal(2500, 300, 80),
np.random.normal(2200, 250, 80)
]),
"Hours": np.concatenate([
np.random.normal(45, 5, 80),
np.random.normal(42, 4, 80),
np.random.normal(38, 3, 80),
np.random.normal(40, 4, 80)
])
})
# --- Create figure with subplots ---
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
# Plot 1: Violin plot for Wage distribution
sns.violinplot(data=data, x="Sector", y="Wage", ax=ax1, palette="Set2", inner="box")
ax1.set_title("Wage Distribution by Sector", fontsize=14, weight="bold")
ax1.set_xlabel("")
ax1.set_ylabel("Monthly Wage (USD)")
ax1.grid(alpha=0.3)
# Plot 2: Box plot for Working Hours
sns.boxplot(data=data, x="Sector", y="Hours", ax=ax2, palette="Set3")
ax2.set_title("Working Hours by Sector", fontsize=14, weight="bold")
ax2.set_xlabel("")
ax2.set_ylabel("Weekly Hours")
ax2.grid(alpha=0.3)
plt.tight_layout()
plt.show()